-
Notifications
You must be signed in to change notification settings - Fork 9
[Do not merge] feat(RHOAIENG-33971): Add LM-Eval inline provider and unit tests #60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Do not merge] feat(RHOAIENG-33971): Add LM-Eval inline provider and unit tests #60
Conversation
Reviewer's GuideThis PR implements a new inline LM-Eval provider by creating LMEvalInline with full job and benchmark management, command-building/execution, and result parsing, refactors the existing remote provider into a dedicated subpackage with updated imports and specs, relaxes the use_k8s configuration validation to support inline mode, and adds extensive unit tests and sample YAML for inline usage. ER diagram for provider configuration YAMLs (inline and remote)erDiagram
PROVIDER {
string provider_id
string provider_type
string config_class
string module
list pip_packages
}
CONFIG {
string base_url
bool use_k8s
}
PROVIDER ||--o| CONFIG : has
REMOTE_PROVIDER {
string provider_id
string provider_type
string config_class
string module
list pip_packages
}
REMOTE_CONFIG {
string url
int max_tokens
string api_token
bool tls_verify
bool use_k8s
}
REMOTE_PROVIDER ||--o| REMOTE_CONFIG : has
Class diagram for new LMEvalInline provider (inline mode)classDiagram
class LMEvalInline {
+LMEvalEvalProviderConfig config
+dict[str, Benchmark] benchmarks
+list[Job] _jobs
+dict[str, dict[str, str]] _job_metadata
+async initialize()
+async list_benchmarks() ListBenchmarksResponse
+async get_benchmark(benchmark_id: str) Benchmark | None
+async register_benchmark(benchmark: Benchmark)
+async run_eval(benchmark_id: str, benchmark_config: BenchmarkConfig, limit="2") Job
+async build_command(task_config: BenchmarkConfig, benchmark_id: str, limit: str, stored_benchmark: Benchmark | None) list[str]
+async _run_command(cmd: list[str], job_id: str)
+_parse_lmeval_results(stdout: str) dict
+_read_results_from_json(output_path: str) dict
+async job_cancel(benchmark_id: str, job_id: str)
+async job_result(benchmark_id: str, job_id: str) EvaluateResponse
+async job_status(benchmark_id: str, job_id: str) Job | None
+async shutdown()
}
LMEvalInline --|> Eval
LMEvalInline --|> BenchmarksProtocolPrivate
LMEvalInline o-- LMEvalEvalProviderConfig
LMEvalInline o-- Benchmark
LMEvalInline o-- Job
LMEvalInline o-- EvaluateResponse
Class diagram for provider spec registration (inline and remote)classDiagram
class InlineProviderSpec {
+api: Api
+provider_type: str
+pip_packages: list[str]
+config_class: str
+module: str
}
class RemoteProviderSpec {
+api: Api
+provider_type: str
+pip_packages: list[str]
+config_class: str
+module: str
}
class ProviderSpec
InlineProviderSpec --|> ProviderSpec
RemoteProviderSpec --|> ProviderSpec
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
28f34e8 to
87e3113
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New security issues found
5e41354 to
80ebed6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bandit found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.
80ebed6 to
09f0264
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
Blocking issues:
- Detected subprocess function 'run' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'. (link)
General comments:
- Avoid using synchronous subprocess.run inside your async _run_command method, as it will block the event loop—consider using asyncio.create_subprocess_exec or run_in_executor instead.
- You’ve marked _create_model_args, _collect_lmeval_args, and build_command as async even though they don’t await anything; converting them to regular methods will simplify the code and avoid confusion.
- The build_command method is quite large—extract sections like flag/string assembly and metadata handling into smaller helper functions to improve readability and testability.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Avoid using synchronous subprocess.run inside your async _run_command method, as it will block the event loop—consider using asyncio.create_subprocess_exec or run_in_executor instead.
- You’ve marked _create_model_args, _collect_lmeval_args, and build_command as async even though they don’t await anything; converting them to regular methods will simplify the code and avoid confusion.
- The build_command method is quite large—extract sections like flag/string assembly and metadata handling into smaller helper functions to improve readability and testability.
## Individual Comments
### Comment 1
<location> `src/llama_stack_provider_lmeval/inline/lmeval.py:287` </location>
<code_context>
+ if os.path.exists(venv_path):
+ cmd[0] = venv_path
+
+ process_id = os.getpid()
+ self._job_metadata[job_id]["process_id"] = process_id
+ logger.info("Running lm_eval command: %s", " ".join(cmd))
</code_context>
<issue_to_address>
**issue (bug_risk):** Using os.getpid() for process tracking may not work for subprocesses.
Since evaluation runs in a subprocess, consider storing the child process's PID to ensure job cancellation works as intended.
</issue_to_address>
### Comment 2
<location> `src/llama_stack_provider_lmeval/inline/lmeval.py:291-298` </location>
<code_context>
+ logger.info("Running lm_eval command: %s", " ".join(cmd))
+
+ # Capture output for result parsing
+ result = subprocess.run(
+ cmd,
+ check=True,
+ capture_output=True,
+ text=True
+ )
+ return result.stdout, result.stderr
</code_context>
<issue_to_address>
**suggestion (bug_risk):** subprocess.run with check=True will raise on non-zero exit codes.
Handle CalledProcessError to capture stdout and stderr on failure, enabling better debugging and job metadata collection.
```suggestion
# Capture output for result parsing
try:
result = subprocess.run(
cmd,
check=True,
capture_output=True,
text=True
)
return result.stdout, result.stderr
except subprocess.CalledProcessError as e:
logger.error(
"lm_eval command failed with exit code %d: %s",
e.returncode,
" ".join(cmd)
)
logger.error("stdout: %s", e.stdout)
logger.error("stderr: %s", e.stderr)
return e.stdout, e.stderr
```
</issue_to_address>
### Comment 3
<location> `src/llama_stack_provider_lmeval/inline/lmeval.py:414` </location>
<code_context>
+ benchmark_id: The ID of the benchmark to run the evaluation on.
+ job_id: The ID of the job to cancel.
+ """
+ job = next((j for j in self._jobs if j.job_id == job_id))
+ if not job:
+ logger.warning("Job %s not found", job_id)
</code_context>
<issue_to_address>
**issue (bug_risk):** next() without a default will raise if job is not found.
Provide a default value to next(), such as None, to prevent StopIteration and ensure the warning is logged.
</issue_to_address>
### Comment 4
<location> `src/llama_stack_provider_lmeval/inline/lmeval.py:419-420` </location>
<code_context>
+ logger.warning("Job %s not found", job_id)
+ return
+
+ if job.status in [JobStatus.completed, JobStatus.failed, JobStatus.cancelled]:
+ logger.warning("Job %s is not running", job_id)
+
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Job status check does not return after logging warning.
Return immediately after logging the warning to prevent attempts to cancel jobs that are already completed or failed.
```suggestion
if job.status in [JobStatus.completed, JobStatus.failed, JobStatus.cancelled]:
logger.warning("Job %s is not running", job_id)
return
```
</issue_to_address>
### Comment 5
<location> `src/llama_stack_provider_lmeval/inline/__init__.py:32-41` </location>
<code_context>
+
+ # Extract base_url from config if available
+ base_url = None
+ if hasattr(config, "model_args") and config.model_args:
+ for arg in config.model_args:
+ if arg.get("name") == "base_url":
+ base_url = arg.get("value")
+ logger.debug(f"Using base_url from config: {base_url}")
+ break
+
+ return LMEvalInline(config=config)
</code_context>
<issue_to_address>
**suggestion:** Extracted base_url is not used in provider instantiation.
Since base_url is only logged and not used in LMEvalInline initialization, consider removing this extraction or passing base_url if needed.
```suggestion
return LMEvalInline(config=config)
```
</issue_to_address>
### Comment 6
<location> `tests/test_lmeval_inline.py:544-545` </location>
<code_context>
+ await self.provider._run_command(cmd)
+
+
+if __name__ == "__main__":
+ unittest.main()
</code_context>
<issue_to_address>
**suggestion (testing):** Suggestion to add test coverage for job_cancel and job_result methods.
Please add unit tests for job_cancel and job_result, covering scenarios like cancelling completed jobs and retrieving results from failed jobs.
Suggested implementation:
```python
class TestJobMethods(unittest.IsolatedAsyncioTestCase):
async def asyncSetUp(self):
# Setup provider and mock jobs
self.provider = ... # Replace with actual provider initialization
self.job_id = "test_job"
self.completed_job_id = "completed_job"
self.failed_job_id = "failed_job"
# Mock job states
self.provider.jobs = {
self.job_id: {"status": "running"},
self.completed_job_id: {"status": "completed"},
self.failed_job_id: {"status": "failed", "result": Exception("Job failed")}
}
async def test_job_cancel_running(self):
# Cancel a running job
result = await self.provider.job_cancel(self.job_id)
self.assertTrue(result)
self.assertEqual(self.provider.jobs[self.job_id]["status"], "cancelled")
async def test_job_cancel_completed(self):
# Try to cancel a completed job
result = await self.provider.job_cancel(self.completed_job_id)
self.assertFalse(result)
self.assertEqual(self.provider.jobs[self.completed_job_id]["status"], "completed")
async def test_job_result_success(self):
# Simulate a successful job result
self.provider.jobs[self.completed_job_id]["result"] = "success"
result = await self.provider.job_result(self.completed_job_id)
self.assertEqual(result, "success")
async def test_job_result_failed(self):
# Simulate a failed job result
with self.assertRaises(Exception) as context:
await self.provider.job_result(self.failed_job_id)
self.assertIn("Job failed", str(context.exception))
if __name__ == "__main__":
unittest.main()
```
- Replace `self.provider = ...` with the actual provider initialization as per your codebase.
- Ensure that `job_cancel` and `job_result` methods exist and follow the expected interface.
- Adjust job dictionary structure if your implementation differs.
</issue_to_address>
### Comment 7
<location> `src/llama_stack_provider_lmeval/inline/lmeval.py:292-297` </location>
<code_context>
result = subprocess.run(
cmd,
check=True,
capture_output=True,
text=True
)
</code_context>
<issue_to_address>
**security (python.lang.security.audit.dangerous-subprocess-use-audit):** Detected subprocess function 'run' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'.
*Source: opengrep*
</issue_to_address>
### Comment 8
<location> `src/llama_stack_provider_lmeval/inline/__init__.py:43` </location>
<code_context>
</code_context>
<issue_to_address>
**issue (code-quality):** Raise a specific error instead of the general `Exception` or `BaseException` ([`raise-specific-error`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/raise-specific-error))
<details><summary>Explanation</summary>If a piece of code raises a specific exception type
rather than the generic
[`BaseException`](https://docs.python.org/3/library/exceptions.html#BaseException)
or [`Exception`](https://docs.python.org/3/library/exceptions.html#Exception),
the calling code can:
- get more information about what type of error it is
- define specific exception handling for it
This way, callers of the code can handle the error appropriately.
How can you solve this?
- Use one of the [built-in exceptions](https://docs.python.org/3/library/exceptions.html) of the standard library.
- [Define your own error class](https://docs.python.org/3/tutorial/errors.html#tut-userexceptions) that subclasses `Exception`.
So instead of having code raising `Exception` or `BaseException` like
```python
if incorrect_input(value):
raise Exception("The input is incorrect")
```
you can have code raising a specific error like
```python
if incorrect_input(value):
raise ValueError("The input is incorrect")
```
or
```python
class IncorrectInputError(Exception):
pass
if incorrect_input(value):
raise IncorrectInputError("The input is incorrect")
```
</details>
</issue_to_address>
### Comment 9
<location> `src/llama_stack_provider_lmeval/inline/lmeval.py:48-49` </location>
<code_context>
async def get_benchmark(self, benchmark_id: str) -> Benchmark | None:
"""Get a specific benchmark by ID."""
benchmark = self.benchmarks.get(benchmark_id)
return benchmark
</code_context>
<issue_to_address>
**suggestion (code-quality):** Inline variable that is immediately returned ([`inline-immediately-returned-variable`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/inline-immediately-returned-variable/))
```suggestion
return self.benchmarks.get(benchmark_id)
```
</issue_to_address>
### Comment 10
<location> `src/llama_stack_provider_lmeval/inline/lmeval.py:168-175` </location>
<code_context>
async def _collect_lmeval_args(self, task_config: BenchmarkConfig, stored_benchmark: Benchmark | None):
lmeval_args = {}
if hasattr(task_config, "lmeval_args") and task_config.lmeval_args:
lmeval_args = task_config.lmeval_args
if hasattr(task_config, "metadata") and task_config.metadata:
metadata_lmeval_args = task_config.metadata.get("lmeval_args")
if metadata_lmeval_args:
for key, value in metadata_lmeval_args.items():
lmeval_args[key] = value
# Check stored benchmark for additional lmeval args
if stored_benchmark and hasattr(stored_benchmark, "metadata") and stored_benchmark.metadata:
benchmark_lmeval_args = stored_benchmark.metadata.get("lmeval_args")
if benchmark_lmeval_args:
for key, value in benchmark_lmeval_args.items():
lmeval_args[key] = value
return lmeval_args
</code_context>
<issue_to_address>
**issue (code-quality):** Use named expression to simplify assignment and conditional [×2] ([`use-named-expression`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-named-expression/))
</issue_to_address>
### Comment 11
<location> `src/llama_stack_provider_lmeval/inline/lmeval.py:193` </location>
<code_context>
async def build_command(
self,
task_config: BenchmarkConfig,
benchmark_id: str,
limit: str,
stored_benchmark: Benchmark | None
) -> list[str]:
"""Build lm_eval command with default args and user overrides."""
logger.info("BUILD_COMMAND: Starting to build command for benchmark: %s", benchmark_id)
logger.info("BUILD_COMMAND: Task config type: %s", type(task_config))
logger.info(
"BUILD_COMMAND: Task config has metadata: %s", hasattr(task_config, "metadata")
)
if hasattr(task_config, "metadata"):
logger.info(
"BUILD_COMMAND: Task config metadata content: %s", task_config.metadata
)
eval_candidate = task_config.eval_candidate
if not eval_candidate.type == "model":
raise LMEvalConfigError("LMEval only supports model candidates for now")
# Create model args - use VLLM_URL environment variable for inference provider
inference_url = os.environ.get("VLLM_URL", "http://localhost:8080/v1")
openai_url = inference_url.replace("/v1", "/v1/completions")
model_args = await self._create_model_args(openai_url, task_config)
if (
stored_benchmark is not None
and hasattr(stored_benchmark, "metadata")
and stored_benchmark.metadata
and "tokenizer" in stored_benchmark.metadata
):
tokenizer_value = stored_benchmark.metadata.get("tokenizer")
if isinstance(tokenizer_value, str) and tokenizer_value:
logger.info("Using custom tokenizer from metadata: %s", tokenizer_value)
tokenized_requests = stored_benchmark.metadata.get("tokenized_requests")
if isinstance(tokenized_requests, bool) and tokenized_requests:
logger.info("Using custom tokenized_requests from metadata: %s", tokenized_requests)
model_args["tokenizer"] = tokenizer_value
model_args["tokenized_requests"] = tokenized_requests
task_name = self._extract_task_name(benchmark_id)
lmeval_args = await self._collect_lmeval_args(task_config, stored_benchmark)
# Start building the command
cmd = ["lm_eval"]
# Add model type (default to local-completions for OpenAI-compatible APIs)
model_type = model_args.get("model_type", "local-completions")
cmd.extend(["--model", model_type])
# Build model_args string
if model_args:
model_args_list = []
for key, value in model_args.items():
if key != "model_type" and value is not None:
model_args_list.append(f"{key}={value}")
if model_args_list:
cmd.extend(["--model_args", ",".join(model_args_list)])
# Add tasks
cmd.extend(["--tasks", task_name])
limit = limit or "2"
if limit:
cmd.extend(["--limit", str(limit)])
# Add output file for JSON results
output_path = f"/tmp/lmeval_results_{uuid.uuid4().hex}.json"
cmd.extend(["--output_path", output_path])
# Add lmeval_args
if lmeval_args:
for key, value in lmeval_args.items():
if value is not None:
flag = f"--{key.replace('_', '-')}"
cmd.extend([flag, str(value)])
logger.info("Generated command for benchmark %s: %s", benchmark_id, " ".join(cmd))
return cmd, output_path
</code_context>
<issue_to_address>
**issue (code-quality):** We've found these issues:
- Use named expression to simplify assignment and conditional [×2] ([`use-named-expression`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-named-expression/))
- Simplify logical expression using De Morgan identities ([`de-morgan`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/de-morgan/))
- Convert for loop into list comprehension ([`list-comprehension`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/list-comprehension/))
- Low code quality found in LMEvalInline.build\_command - 25% ([`low-code-quality`](https://docs.sourcery.ai/Reference/Default-Rules/comments/low-code-quality/))
<br/><details><summary>Explanation</summary>
The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.
How can you solve this?
It might be worth refactoring this function to make it shorter and more readable.
- Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines.
- Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.</details>
</issue_to_address>
### Comment 12
<location> `src/llama_stack_provider_lmeval/inline/lmeval.py:300` </location>
<code_context>
def _parse_lmeval_results(self, stdout: str) -> dict:
"""Parse evaluation results from stdout."""
# Try to find JSON results in the output
json_pattern = r'\{[^{}]*"results"[^{}]*\}'
json_matches = re.findall(json_pattern, stdout, re.DOTALL)
scores = {}
if json_matches:
try:
# Parse JSON results
for match in json_matches:
result_data = json.loads(match)
if "results" in result_data:
for task, metrics in result_data["results"].items():
for metric, value in metrics.items():
if isinstance(value, int | float):
score_key = f"{task}:{metric}"
scores[score_key] = ScoringResult(
aggregated_results={metric: value},
score_rows=[{"score": value}]
)
except json.JSONDecodeError:
logger.warning("Failed to parse JSON results from lm_eval output")
# If no JSON found, try to parse table format
if not scores:
lines = stdout.split('\n')
for i, line in enumerate(lines):
if 'Metric' in line and '|' in line:
# Found table header, parse subsequent lines
for j in range(i + 2, len(lines)): # Skip header separator
if not lines[j].strip() or '|' not in lines[j]:
break
parts = [p.strip() for p in lines[j].split('|') if p.strip()]
if len(parts) >= 5:
task = parts[0] or "arc_easy" # Default task name
metric = parts[4]
try:
value = float(parts[6]) if len(parts) > 6 else 0.0
score_key = f"{task}:{metric}"
scores[score_key] = ScoringResult(
aggregated_results={metric: value},
score_rows=[{"score": value}]
)
except (ValueError, IndexError):
continue
break
return EvaluateResponse(generations=[], scores=scores)
</code_context>
<issue_to_address>
**issue (code-quality):** Low code quality found in LMEvalInline.\_parse\_lmeval\_results - 18% ([`low-code-quality`](https://docs.sourcery.ai/Reference/Default-Rules/comments/low-code-quality/))
<br/><details><summary>Explanation</summary>The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.
How can you solve this?
It might be worth refactoring this function to make it shorter and more readable.
- Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines.
- Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.</details>
</issue_to_address>
### Comment 13
<location> `src/llama_stack_provider_lmeval/inline/lmeval.py:423-424` </location>
<code_context>
async def job_cancel(self, benchmark_id: str, job_id: str) -> None:
"""Cancel a running evaluation job.
Args:
benchmark_id: The ID of the benchmark to run the evaluation on.
job_id: The ID of the job to cancel.
"""
job = next((j for j in self._jobs if j.job_id == job_id))
if not job:
logger.warning("Job %s not found", job_id)
return
if job.status in [JobStatus.completed, JobStatus.failed, JobStatus.cancelled]:
logger.warning("Job %s is not running", job_id)
elif job.status in [JobStatus.in_progress, JobStatus.scheduled]:
process_id = self._job_metadata.get(job_id, {}).get("process_id")
if process_id:
process_id = int(process_id)
logger.info("Killing process %s", process_id)
try:
os.kill(process_id, signal.SIGTERM)
await asyncio.sleep(2)
except ProcessLookupError:
logger.warning("Process %s not found", process_id)
try:
os.kill(process_id, signal.SIGKILL)
except ProcessLookupError:
logger.warning("Process %s not found", process_id)
job.status = JobStatus.cancelled
logger.info("Successfully cancelled job %s", job_id)
</code_context>
<issue_to_address>
**suggestion (code-quality):** Use named expression to simplify assignment and conditional ([`use-named-expression`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-named-expression/))
```suggestion
if process_id := self._job_metadata.get(job_id, {}).get("process_id"):
```
</issue_to_address>
### Comment 14
<location> `src/llama_stack_provider_lmeval/inline/lmeval.py:453` </location>
<code_context>
async def job_result(self, benchmark_id: str, job_id: str) -> EvaluateResponse:
"""Get the results of a completed evaluation job.
Args:
benchmark_id: The ID of the benchmark to run the evaluation on.
job_id: The ID of the job to get the results of.
"""
job = await self.job_status(benchmark_id, job_id)
if job is None:
logger.warning("Job %s not found", job_id)
return EvaluateResponse(generations=[], scores={})
if job.status == JobStatus.completed:
# Get results from job metadata
job_metadata = self._job_metadata.get(job_id, {})
results = job_metadata.get("results")
if results:
return results
else:
logger.warning("No results found for completed job %s", job_id)
return EvaluateResponse(generations=[], scores={})
elif job.status == JobStatus.failed:
logger.warning("Job %s failed", job_id)
error_msg = self._job_metadata.get(job_id, {}).get("error", "Unknown error")
return EvaluateResponse(generations=[], scores={}, metadata={"error": error_msg})
else:
logger.warning("Job %s is still running", job_id)
return EvaluateResponse(generations=[], scores={}, metadata={"status": "running"})
</code_context>
<issue_to_address>
**issue (code-quality):** We've found these issues:
- Use named expression to simplify assignment and conditional ([`use-named-expression`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-named-expression/))
- Remove unnecessary else after guard condition ([`remove-unnecessary-else`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/remove-unnecessary-else/))
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| # Extract base_url from config if available | ||
| base_url = None | ||
| if hasattr(config, "model_args") and config.model_args: | ||
| for arg in config.model_args: | ||
| if arg.get("name") == "base_url": | ||
| base_url = arg.get("value") | ||
| logger.debug(f"Using base_url from config: {base_url}") | ||
| break | ||
|
|
||
| return LMEvalInline(config=config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Extracted base_url is not used in provider instantiation.
Since base_url is only logged and not used in LMEvalInline initialization, consider removing this extraction or passing base_url if needed.
| # Extract base_url from config if available | |
| base_url = None | |
| if hasattr(config, "model_args") and config.model_args: | |
| for arg in config.model_args: | |
| if arg.get("name") == "base_url": | |
| base_url = arg.get("value") | |
| logger.debug(f"Using base_url from config: {base_url}") | |
| break | |
| return LMEvalInline(config=config) | |
| return LMEvalInline(config=config) |
| metadata_lmeval_args = task_config.metadata.get("lmeval_args") | ||
| if metadata_lmeval_args: | ||
| for key, value in metadata_lmeval_args.items(): | ||
| lmeval_args[key] = value | ||
|
|
||
| # Check stored benchmark for additional lmeval args | ||
| if stored_benchmark and hasattr(stored_benchmark, "metadata") and stored_benchmark.metadata: | ||
| benchmark_lmeval_args = stored_benchmark.metadata.get("lmeval_args") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): Use named expression to simplify assignment and conditional [×2] (use-named-expression)
| process_id = self._job_metadata.get(job_id, {}).get("process_id") | ||
| if process_id: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): Use named expression to simplify assignment and conditional (use-named-expression)
| process_id = self._job_metadata.get(job_id, {}).get("process_id") | |
| if process_id: | |
| if process_id := self._job_metadata.get(job_id, {}).get("process_id"): |
eb8a389 to
867da4d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New security issues found
867da4d to
0a2e7f0
Compare
60644a9 to
4596c27
Compare
saichandrapandraju
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you @christinaexyou , just a couple of nits from me.
|
|
||
| self._job_metadata[job_id]["process_id"] = str(process.pid) | ||
|
|
||
| stdout, stderr = await asyncio.wait_for(process.communicate(), timeout=300) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we optionally parameterize timeout?
| results_data, job_id | ||
| ) | ||
| # Store the parsed results in job metadata | ||
| self._job_metadata[job_id]["results"] = parsed_results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: if result size is not small, maybe we can save it in the files store and just track the results_file id.
|
|
||
| # Create model args - use VLLM_URL environment variable for inference provider | ||
| inference_url = os.environ.get("VLLM_URL", "http://localhost:8080/v1") | ||
| openai_url = inference_url.replace("/v1", "/v1/completions") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need some checks here to make sure we're not polluting the url.
IIRC, if we need openai url, it should ends_with openai/v1/completions.
| max_tokens: ${env.VLLM_MAX_TOKENS:=4096} | ||
| api_token: ${env.VLLM_API_TOKEN:=fake} | ||
| tls_verify: ${env.VLLM_TLS_VERIFY:=false} | ||
| eval: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: shall we start using module: way of specifying the provider as lls-core is moving away from external_providers_dir?
Summary by Sourcery
Add a new LM-Eval inline provider alongside the existing remote provider structure by relocating remote code to a dedicated subpackage, updating provider specs and configurations, and introducing a full suite of unit tests for inline evaluation functionality
New Features:
Enhancements:
Chores: